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A Framework for Symmetric Part Detection in 

Cluttered Scenes 

Tom Lee, Sanja Fidler, Alex Levinshtein, Cristian Sminchisescu, and Sven Dickinson 


Abstract —The role of symmetry in computer vision has waxed and waned in importance during the evolution of the field from its 
earliest days. At first figuring prominently in support of bottom-up indexing, it fell out of favor as shape gave way to appearance 
and recognition gave way to detection. With a strong prior in the form of a target object, the role of the weaker priors offered by 
perceptual grouping was greatly diminished. However, as the field returns to the problem of recognition from a large database, 
the bottom-up recovery of the parts that make up the objects in a cluttered scene is critical for their recognition. The medial axis 
community has long exploited the ubiquitous regularity of symmetry as a basis for the decomposition of a closed contour into 
medial parts. However, today’s recognition systems are faced with cluttered scenes, and the assumption that a closed contour 
exists, i.e., that figure-ground segmentation has been solved, renders much of the medial axis community’s work inapplicable. In 
this article, we review a computational framework, previously reported in [1], [2], (3), that bridges the representation power of the 
medial axis and the need to recover and group an object’s parts in a cluttered scene. Our framework is rooted in the idea that a 
maximally inscribed disc, the building block of a medial axis, can be modeled as a compact superpixel in the image. We evaluate 
the method on images of cluttered scenes. 

Index Terms —symmetry; medial axis; perceptual grouping; object recognition 
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1 Introduction 

YMMETRY is a long-standing, interdisciplinary 
form that spans across the arts and sciences, 
covering fields as disparate as mathematics, biology, 
architecture, and music si. The roles played by sym¬ 
metry are equally diverse, and can involve being 
an abstract object of analysis, a balancing structure 
in nature, or an attractor of visual attention. The 
common thread in all of the above is that symmetry 
is ubiquitously present in both natural objects and 
artificial objects. It is no accident that we constantly 
encounter symmetry through our eyesight, and in 
fact. Gestalt psychologists 0 of the previous century 
proposed that symmetry is a physical regularity in our 
world that has been exploited by the human visual 
system to yield a powerful perceptual grouping mech¬ 
anism. Experiments show evidence that we respond 
to symmetry before being consciously aware of it 0. 

The scope of this article lies within the domain of 
computer vision , a comparatively young field that has 
adopted symmetry since its infancy. Inspired by a 
computational understanding of human vision, per¬ 
ceptual grouping played a prominent role in support 
of early object recognition systems, which typically 
took an input image and a set of shape models, and 
identified which of the models was visible in the 
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image. Mid-level shape priors were crucial in group¬ 
ing causally related shape features into discriminative 
shape indices that were used to prune the set down 
to a few promising candidates that might account 
for a query. Of these shape priors, one of the most 
powerful is a configuration of parts, in which a set of 
related parts belonging to the same object is recovered 
without any prior knowledge of scene content. 

The use of symmetry to recover generic parts from 
an image can be traced back to the earliest days of 
computer vision, and includes the medial axis trans¬ 
form (MAT) of Blum (1967) 10, generalized cylinders 
of Binford (1971) 0, superquadrics of Pentland (1986) 
0, and geons of Biederman (1985) [lOl, to name just 
a few examples. Central to a large body of approaches 
based on medial symmetry is the MAT, which decom¬ 
poses a closed 2D shape into a set of connected medial 
branches corresponding to a configuration of parts, 
providing a powerful parts-based decomposition of 
the shape suitable for shape matching, e.g., Siddiqi et 
al. (1999) [10 and Sebastian et al. (2004) El . For a 
definitive survey on medial symmetry, see Siddiqi et 
al (2008) 1151 . 

In more recent years, the field of computer vi¬ 
sion has shifted in focus toward the object detection 
problem, in which the input image is searched for a 
specific target object. One reason for this lies in the 
development of machine learning algorithms that can 
leverage large amounts of training data to produce 
robust classification results. This led to rapid progress 
in the development of object detection systems, en¬ 
abling them to handle increasing levels of background 
noise, occlusion, and variability in input images [14]. 
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Fig. 1. Our representation of symmetric parts: In (a), 
the shape of the runner’s body is transformed into its 
medial axis (red), a skeleton-like structure that decom¬ 
poses the shape into branch-like segments, e.g., the 
leg. The leg’s shape is swept out by a sequence of 
discs (green) lying along the medial axis. In (b), the 
shape of the same leg is composed from superpixels 
that correspond to the sequence of discs. The scope of 
this article’s framework is limited to detecting symmet¬ 
ric parts corresponding to individual branches. 


This development established the standard practice of 
working with input domains of real images of clut¬ 
tered scenes, significantly increasing the applicability 
of object recognition systems to real problems. 

A parallel advance in perceptual grouping, how¬ 
ever, did not occur for a simple reason: With the 
target object already known, indexing is not required 
to select it, and perceptual grouping is not required 
to construct a discriminative shape index. As a re¬ 
sult, perceptual grouping activity at major conferences 
has diminished along with the supporting role of 
symmetry m, m. However, there are clear signs 
that the object recognition community is moving from 
appearance back to shape, and from object detection 
back to multiclass object categorization. Shape-based 
perceptual grouping will play a critical role in facili¬ 
tating this transition. 

In attempting to bring back medial symmetry in 
support of perceptual grouping, we observe that 
the subcommunity's efforts have not kept pace with 
mainstream object recognition. Specifically, medial 
symmetry approaches typically assume that the input 
image is a foreground object free from occluding 
and background objects and, accordingly, lack the 
ability to segment foreground from background, an 
ingredient crucial for tackling contemporary datasets. 
It is clear that the MAT cannot be reintroduced with¬ 
out combining it with an approach for figure-ground 
segmentation. In this article, we review current work 


along this trajectory as represented by Lee et ah (2013) 
0 and earlier work by Levinshtein et al. (2009) 0 , 0 . 

In the context of symmetric part detection, 0 in¬ 
troduced an approach that leveraged earlier work 0 
to build a MAT-based superpixel grouping method. 
Since the proposed representation is central to our 
approach, we proceed with a brief overview of the 
latter. A bottom-up method was introduced which 
first detected symmetric parts, then grouped them 
non-accidentally to form a discriminative shape index. 
By establishing a correspondence between superpixels 
and maximally inscribed discs, the method formu¬ 
lated a superpixel grouping problem that exploited 
symmetry as a grouping cue. The method thus re¬ 
covered symmetric parts by grouping superpixels that 
represented discs of the same part. An evaluation was 
presented to show a significant improvement over 
other symmetry-based approaches. 

Subsequently, 0 furthered the development of the 
above ideas on two complementary fronts. First, the 
medial representation was used to derive a sequence 
optimization problem for grouping, whose solution 
was shown to bring significant improvements in re¬ 
sults. The approach uses a grouping algorithm that 
is principled and more effective than in 0. Second, 
symmetry was captured more accurately by increas¬ 
ing the number of model parameters. While a limited 
number of parameters previously captured scale and 
orientation, the method's invariance was improved 
by additionally capturing bending and tapering. The 
resulting affinity function was also shown to support 
an improvement. 

This article takes a high-level view of the work in 
reintroducing the MAT with figure-ground segmenta¬ 
tion capability, enabling us to draw insights from a 
higher vantage point. We first develop the necessary 
background to trace the development from its origins 
in the MAT, through 0 , and finally to 0 . In doing 
so, we establish a framework that makes clear the 
connections among previous work. For example, it 
follows from our exposition that 0 is an alterna¬ 
tive instance of our framework. More generally, our 
unified framework benefits from the rich structure 
of the MAT while directly tackling the challenge of 
segmenting out background noise in a cluttered scene. 
Our model is discriminatively trained and stands out 
from typical perceptual grouping methods that use 
predefined grouping rules. Using experimental image 
data, we present both qualitative results and a quan¬ 
titative metric evaluation to support the development 
of the components of our approach. 

2 Related work 

Symmetry is one of several important Gestalt cues 
that contribute to perceptual grouping. Symmetry 
plays neither an exclusive nor an isolated role in the 
presence of other cues. Contour closure, for example. 
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is another mid-level cue whose role will increase as 
the community relies more on bottom-up segmen¬ 
tation in the absence of a strong object prior, e.g., 
117]. Symmetry may also be effectively combined 
with other mid-level cues, e.g., |18|, [19]. For brevity, 
we restrict our survey of related work to symmetry 
detection. 

The MAT, along with its many descendant repre¬ 
sentations such as the shock graph [12], [11 ], 1201 , 
ED and bone graph l22l , (231 , provides an elegant de¬ 
composition of an object's shape into symmetric parts; 
however, it made the unrealistic assumption that the 
shape was segmented, and is thus not directly suitable 
for today's image domains. For symmetry approaches 
in the cluttered image domain, we first consider the 
filter-based approach, which first attempts to detect 
local symmetries, in the form of parts, and then 
finds non-accidental groupings of the detected parts 
to form indexing structures. Example approaches in 
this domain include the multiscale peak paths of 
Crowley & Parker (1984) [24], the multiscale blobs of 
Shokoufandeh et al. (1999) E51 . the ridge detectors of 
Mikolajczyk & Schmid (2002) [26|, and the multiscale 
blobs and ridges of Lindeberg & Bretzner (2003) [27], 
and Shokoufandeh et al. (2006) l28l . Unfortunately, 
these filter-based approaches yield many false posi¬ 
tive and false negative symmetric part detections, and 
the lack of explicit part boundary extraction makes 
part attachment detection unreliable. 

A more powerful filter-based approach was recently 
proposed by Tsogkas & Kokkinos (2012) l29l , in which 
integral images are applied to an edge map to effi¬ 
ciently compute discriminating features, including a 
novel spectral symmetry feature, at each pixel at each 
of multiple scales. Multiple instance learning is used 
to train a detector that combines these features to yield 
a probability map which, after non-maximum sup¬ 
pression, yields a set of medial points. The method is 
computationally intensive yet parallelizable, and the 
medial points still need to be parsed and grouped into 
parts. But the method shows promise in recovering an 
approximation to a medial axis transform of an image. 

The contour-based approach is a less holistic ap¬ 
proach that addresses the combinatorial challenge of 
grouping extracted contours. Examples include Brady 
& Asada (1984) E], Connell & Brady (1987) Efl . 
Ponce (1990) E2, Cham & Cipolla (1995, 1996) EU, 
l34l , Saint-Marc et al (1993) |35|, Liu et al (1998) 
ESI . Yla-Jaaski & Ade (1996) G2, Stahl & Wang 
(2008) [38], and Fidler et al. (2014) [39]. Since these 
methods are contour-based, they have to deal with the 
issue of computational complexity of contour group¬ 
ing, particularly when cluttered scenes contain many 
extraneous edges. Some require smooth contours or 
initialization, while others were designed to detect 
symmetric objects and cannot detect and group the 
symmetric parts that make up an asymmetric object. 
A more recent line of methods extract interest point 


features, such as SIFT 03, and group them across 
an unknown symmetry axis 02, l42l . While these 
methods exploit distinctive pairwise correspondences 
among local features, they critically depend on reli¬ 
able feature extraction. 

A recent approach by Narayanan and Kimia [43] 
proposes an elegant framework for grouping medial 
fragments into meaningful groups. Rather than as¬ 
suming a figure-ground segmentation, the approach 
computes a shock graph over the entire image of a 
cluttered scene, and then applies a sequence of medial 
transforms to the medial fragments, maintaining a 
large space of grouping hypotheses. While the method 
compares favorably to figure-ground segmentation 
and fragment generation approaches, the high com¬ 
putational complexity of the approach restricts it to 
images with no more than 20 contours. 

Our approach, represented in the literature by (T|, 
El, El, is qualitatively different from both filter-based 
and contour-based approaches, offering a region-based 
approach which perceptually groups together com¬ 
pact regions (segmented at multiple scales using 
superpixels) representing deformable maximal discs 
into symmetric parts. We note that while |2) has 
an additional step that groups symmetric parts into 
full objects, the scope of our framework is limited 
to detecting symmetric parts. In doing so, we avoid 
the low precision that often plagues the filter-based 
approaches, along with the high complexity that often 
plagues the contour-based approaches. 

3 Representing symmetric parts 

Our approach rests on the combination of medial sym¬ 
metry and superpixel grouping [l], El, 0, and in this 
section we formally connect the two ideas together. 
We proceed with the medial axis transform (MAT) 0 
of an object's shape, as illustrated with the runner in 
Figure [l] The set of maximally inscribed discs plays 
the central role, whose centers (called medial points) 
trace out the skeleton-like medial axis of the object. 
We can identify the object's parts by decomposing the 
medial axis into its branch-like linear segments. We 
note that each object part is swept out by the sequence 
of maximally inscribed discs along the correspond¬ 
ing segment of the medial axis. For details on the 
relationship between the medial axis and the simpler 
reflective axis of symmetry, see Siddiqi et al. (2008) 

H5i . 

The link between discs and superpixels is estab¬ 
lished by recently developed approaches that over¬ 
segment an image into superpixels of compact and 
uniform scale. In order to view superpixels as discs, 
we note that just as superpixels are attracted to parts' 
boundaries, we imagine removing the circular con¬ 
straint on discs and allowing them to deform to the 
boundary, resulting in "deformable discs". We will 
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(a) (b) (c) (d) 


Fig. 2. To compose a part’s shape from superpixels 
in a given input image (a), we compute superpixels 
at multiple scales (b-c), and combine superpixels from 
different scales (d). 


henceforth use the terms "superpixel" and "disc" in¬ 
terchangeably. The disc's shape deforms to the bound¬ 
ary provided that it remains compact (not too long 
and thin), resulting in a subregion that aligns well 
with the part's boundary on either side, when such 
a boundary exists. In contrast with the maximal disc, 
which is only bitangent to the boundary, as shown in 
Figure [lj the number of discs required to compose a 
part's shape is far less than the number required using 
maximal discs. 

In an input image domain of cluttered scenes, the 
vast majority of superpixels will not correspond to 
true discs of an object's parts, and thus it is suit¬ 
able to treat superpixels as a set of candidate discs. 
Furthermore, a superpixel that is too fine or too 
coarse for a given symmetric part fails to relate its 
opposing boundaries together into a true disc, and a 
tapered part may be composed of discs of different 
sizes, as shown in Figure [2] Since we have no prior 
knowledge of a part's scale, and an input image may 
contain object parts of different scales, we compute 
superpixels at different scales, and take their union 
as a set of candidate discs. 

Our goal is to perceptually group discs that belong 
to the same part. To facilitate grouping decisions, 
we will define a pairwise affinity function to capture 
non-accidental relations between discs. Since the vast 
majority of superpixels will not correspond to true 
discs, however, we must manage the complexity of 
the search space. By restricting affinities to adjacent 
discs, we exploit one of the most basic grouping cues, 
namely, proximity , which dictates that nearby discs are 
more likely to belong to the same medial part. We 
enlist the help of more sophisticated cues, however, 
to separate those pairs of discs that belong to the 
same part from those that do not. Viewing superpixels 
as discs allows us to directly exploit the structure 
of medial symmetry to define the affinity. In Section 
[4j we motivate and define the affinity function from 
perceptual grouping principles to set up a weighted 


graph Q of disc candidates. In Section [5j we discuss 
alternative graph-based algorithms for grouping discs 
into medial parts. Section [6] presents qualitative and 
quantitative experiments, while Section [7] draws some 
conclusions about the framework. 

4 Disc affinity 

Since bottom-up grouping is category-agnostic, a sup¬ 
porting disc affinity must accommodate variations 
across objects of all types. The affinity A(di,dj ) be¬ 
tween discs di and d 3 must be robust against variabil¬ 
ity not only within object categories, but also variabil¬ 
ity between object categories. For a discriminatively 
trained affinity, it is helpful to extract features that 
reduce the variability for the classifier. In the follow¬ 
ing sections we define both shape and appearance 
features on the region scope defined by di and dj. 

4.1 Shape features 

The local shape of discs is captured by a spatial 
histogram of gradient pixels, as illustrated in Figure [3] 
By encoding the distribution of the boundary edgels 
of the region defined by the union of the two discs, we 
capture mid-level shape while avoiding features spe¬ 
cific to the given exemplar. This representation offers 
us a degree of robustness that is helpful for training 
the classifier, however it is not perfect—it remains 
sensitive to variations like scale and orientation, to 
name a few, and can thus allow the classifier to overfit 
to training examples. 

We turn to medial symmetry to capture these un¬ 
wanted variations, as the first step in making the 
feature invariant to such changes. Specifically, we 
locally model the shape by fitting the parameters of 
a symmetric shape to the region. We refer to a vector 
w of warping parameters that subsequently define a 
warping function W : R 2 R 2 that is used to remove 
the variations from the space, in effect normalizing 
the local coordinate system. Figure [3] visualizes the pa¬ 
rameters w of a deformable ellipse fit to a local region, 
the medial axis before and after the local curvature 
was "warped out" from the coordinate system, and 
the spatial histogram computed on the normalized 
coordinate system. 

Before describing the spatial histogram in detail, we 
discuss a class of ellipse-based models for modeling 
the local medial symmetry. Ellipses represent ideal 
shapes of an object's parts, and in particular are 
shapes that are symmetric about their major axes. A 
standard ellipse is parameterized by w e = (p, 0,a), 
where p denotes its position, 6 its orientation, and 
a = (a x ,a y ) the lengths of its major and minor axes. 
The parameter vector w e is analytically fit to the 
local region and is used to define the corresponding 
warping function W e ( w e ). 

Historically, we first obtained the warping param¬ 
eters with an ellipse 0. While the advantages of 
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Fig. 3. Improving invariance with a deformable ellipse: given two adjacent candidate discs, the first step is to 
fit the ellipse parameters to the region defined by their corresponding superpixels in (a). The top row shows 
invariance achieved with a standard ellipse. The ellipse’s fit is visualized with the major axis in (b), the region’s 
boundary edgels before (c) and after (d) warping out the unwanted variations, and the resulting spatial histogram 
of gradient pixels (e). See text for details. The bottom row shows the corresponding steps (f-i) obtained by the 
deformable ellipse. Comparing the results, a visually more symmetric feature is obtained by the deformable 
ellipse, which fits tightly around the region’s boundary as compared with the standard ellipse. 


using the ellipse lie in its simplicity and ease of 
fitting, shortcomings were identified in its tendency 
to provide too coarse a fit to the boundary to yield 
an accurate enough warping function. Accordingly 
in we added deformation parameters to obtain a 
better overall fit across all examples. Despite a higher 
computational cost of fitting, the deformable model 
was shown to yield quantitative improvements. 

Specifically, we obtain invariance to bending and 
tapering deformations by augmenting the ellipse pa¬ 
rameters as follows: w d = (p, 0 , a, 6, t) with the bend¬ 
ing radius b along the major axis and tapering slope 
t along the major axis. The parameter vector w d is 
fit by initializing as a standard ellipse and iteratively 
fitting it to the local region's boundary with a non¬ 
linear least-squares algorithm. The fitted parameters 
are then used to define the warping function Wd(wd) 
corresponding to the deformable ellipse. 

Only once the warping function W( w) is fit to 
the local region and applied to normalize the local 
coordinate system do we compute the spatial his¬ 
togram feature. We place a 10 x 10 grid on the warped 
region, and focusing on the model fit to the union 
of the two discs, we scale the grid to cover the area 
[—1.5a x , 1.5aJ x [— 1.5a y , l.ha y ]. Using the grid, we 
compute a 2D histogram on the normalized boundary 
coordinates weighted by the edge strength of each 
boundary pixel. Figure [3] illustrates the shape feature 
computed for the disc pair. We train a SVM classifier 
on this 100-dimensional feature using our manually 
labeled superpixel pairs, labeled as belonging to the 
same part or not. The margin from the classifier is fed 
into a logistic regressor in order to obtain the shape 
affinity A shape (di,dj ) in the range [0,1]. 

4.2 Appearance features 

Aside from medial symmetry, we include appearance 
similarity as an additional grouping cue. While object 


parts may vary widely in color and texture, regions of 
similar appearance tend to belong to the same part. 
We extract an appearance feature on the discs di,dj 
that encodes their dissimilarity in color and texture. 
Specifically, we compute the absolute difference in 
mean RGB color, absolute difference in mean HSV 
color, RGB and HSV color variances in both discs, 
and histogram distance in HSV space, yielding a 27- 
dimensional appearance feature. To improve classi¬ 
fication, we compute quadratic kernel features, re¬ 
sulting in a 406-dimensional appearance feature. We 
train a logistic regressor with LI-regularization to 
prevent overfitting on a relatively small dataset, while 
emphasizing the weights of more important features. 
This yields an appearance affinity function between 
two discs A app (di,dj). Training the appearance affinity 
is easier than training the shape affinity. For positive 
examples, we choose pairs of adjacent superpixels that 
are contained inside a figure in the figure-ground seg¬ 
mentation, whereas for negative examples, we choose 
pairs of adjacent superpixels that span figure-ground 
boundaries. 

We combine the shape and appearance affinities 
using a logistic regressor to obtain the final pairwise 
affinity A(di,dj). Both the shape and the appearance 
affinities, as well as the final affinity A(d i ,dj) / were 
trained with a regularization parameter of 0.5 on the 
LI-norm of the logistic coefficients. 

5 Grouping discs 

Given a graph Q of discs weighted by affinities, the 
final step is to group discs that belong to the same 
symmetric part. If two adjacent discs correspond to 
medial points belonging to the same medial axis, 
they can be combined to extend the symmetry. This 
is the basis for defining the pairwise affinities in 
Q, and it is how we exploit our medial represen¬ 
tation of symmetric parts for grouping. Specifically, 
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Fig. 4. In our approach, the (a) input image of 
foreground leaves is oversegmented into superpixels, 
and a (b) weighted graph Q is built that captures 
the pairwise affinities that are computed among the 
superpixels. A graph-based grouping algorithm takes 
as input the graph g, which may contain false positive 
affinities between the leaves, as shown in (b). In this 
figure, we illustrate the relative advantage of (d) se¬ 
quence optimization over (c) agglomerative clustering. 
In (c), merging the vertices in Q results in a cluster 
that undersegments the leaves, combining them into a 
single symmetric part that violates the assumption that 
a part is composed of a linear sequence of discs. In 
(d), the branching constraint is built into the sequence¬ 
finding algorithm which prevents symmetric parts from 
having tree-structured discs, and correctly segments 
the leaves into two distinct parts. 


the affinity between two adjacent discs reflects the 
degree to which it is believed that they not only 
non-accidentally relate the two opposing boundaries 
together, but that they are centered along the same 
medial axis. In this section, we adapt and discuss 
two alternative graph-based algorithms, namely, the 
agglomerative clustering algorithm of Felzenszwalb 
& Huttenlocher (2004) [44], and the sequence-finding 
algorithm in the salient curve detection method of 
Felzenszwalb & Me Allester (2006) 1451 . 

5.1 Agglomerative clustering 

Our first grouping approach is based on agglomera¬ 
tive clustering l44l . The algorithm takes as input the 
weighted graph Q and merges edges in increasing 
order of weights. Each merge represents a grouping of 
discs, and the connected components that result cor¬ 
respond to symmetric parts. Grouping is performed 
efficiently in 0(e loge) time, where e is the number of 
edges in Q. We refer the reader to |2| for details on 
the algorithm's adaptation to the setting of grouping 
discs. 


D* = {do,di,d 2 ,ds) 



d 4 ? 

d 4 ? 


Fig. 5. Grouping by dynamic programming: The iter¬ 
ative step of the algorithm grows sequences by ex¬ 
tracting a sequence D* from the priority queue, and 
returning longer sequences to the queue obtained by 
extending the end of D* with adjacent discs. See text 
for details. 


The greedy approach, while fast, is unfortunately 
underconstrained in allowing merges to occur be¬ 
tween branch-structured clusters, resulting in tree¬ 
like clusters as illustrated in Figure [3] These types of 
clusters can occur as frequently as spuriously high 
affinity values (false positives) occur, thus motivating 
the need to constrain the growth of clusters within 
medial branches. 


5.2 Sequence optimization by dynamic program¬ 
ming 

Our second approach is dynamic programming used 
in [Q, which observes that each symmetric part is 
swept out by an ordered sequence of discs. Discs along 
the same medial axis are thus not only combined in 
pairs, but can be traced out linearly. This allows us 
to reformulate the problem of superpixel grouping as 
finding sequences of discs in a weighted graph Q that 
belong to the same symmetric part. We thus obtain a 
grouping approach in which the desired branching 
constraint is inherent in the problem formulation. 
As illustrated in Figure [4j the algorithm applied to 
the same graph prevents the resulting clusters from 
violating the branching constraint. 

Before describing the steps of the dynamic pro¬ 
gramming algorithm, we note that it solves a discrete 
optimization problem, and thus represents a princi¬ 
pled reformulation of our grouping problem. This 
includes defining an objective function that captures 
the goal of the problem, which is missing from the 
first approach, and making use of dynamic program¬ 
ming that efficiently solves for a global optimum. We 
specifically borrow from the optimization framework 
used for salient curve detection in Felzenszwalb & 
McAllester l45l and adapt it for symmetric part de¬ 
tection. 

The application of [45] to our setting is best ex¬ 
plained via their method for curve detection. The 
method takes as input a graph with weights defined 
on edges, and "transition weights" defined on pairs of 
adjacent edges. A salient curve is modeled as a valid 
sequence of edges, and a regularized cost function is 
defined on valid sequences that includes a normalized 
sum of the weights along the given sequence. Salient 
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curves are found by globally optimizing the cost 
function using a dynamic programming algorithm. 

In our setting, the graph Q supplies weights be¬ 
tween adjacent discs, and we define a valid sequence 
of discs (of variable length) by D = (do, di,..., d n ), 
which represents a symmetric part. The criteria that 
we want to optimize—good symmetry along the me¬ 
dial axis and a maximally long axis—is provided by 
the affinity graph Q. The regularized cost function, 
cost(D), is defined correspondingly, and favors good 
internal affinity with a normalized sum over the 
affinities along the given sequence, and encourages 
longer sequences with a regularization term. Affinities 
defined over longer subsequences corresponding to 
the transition weights have a smoothing effect on 
the preferred sequences. Details on the cost function 
including its mathematical form can be found in H). 

We now summarize the dynamic programming 
steps for globally minimizing cost(D). The core step 
is illustrated in Figure [5j and details can be found 
in l45l . The algorithm initializes a priority queue Q 
of candidate sequences with all possible sequences 
of unit length, then pursues a best-first search strat¬ 
egy of iteratively extending the cheapest candidate 
sequences. Each edge (d^i, di) is directed such that a 
sequence of edges terminating at di can be extended 
with an edge starting at di. At each iteration, as 
shown in Figure |5] the most promising sequence D* 
is removed from Q, and new candidate sequences are 
proposed by extending the end of D* with adjacent 
discs. If an extended sequence ending at an edge im¬ 
proves the cost of an existing sequence ending at the 
same edge, it is added back into Q. To find multiple 
sequences from the graph corresponding to different 
symmetric parts, we iteratively remove sequences that 
are already found and re-minimize the cost, until a 
maximum cost is reached. 


6 Results 

We present an evaluation of our approach, first qual¬ 
itatively in Section |6.1[ then quantitatively in Section 
6.2 Our qualitative results are drawn from sample 
input images and illustrate particular strengths and 
weaknesses of our approach. In our quantitative eval¬ 
uation, we use performance metrics on two different 
datasets to gauge the contributions of different com¬ 
ponents in our approach. Figure [6] visualizes detected 
masks returned by our method, specifically showing 
the top 15 detected parts on sample input images. 
Parts are ranked by the optimization objective func¬ 
tion. On each part's mask, we indicate the associated 
disc centers and the medial axis via connecting line 
segments. All results reported are generated with 
superpixels computed using normalized cuts [46], at 
multiple scales corresponding to 25, 50, 100, and 200 
superpixels per image. 

Our evaluation employs two image datasets of clut¬ 
tered scenes. The first dataset is a subset of 81 images 




Fig. 6. Multiple symmetric parts: for each image (a), 
(b) below we show the top 15 masks detected as 
symmetric parts. Each mask is detected as a sequence 
of discs, whose centers are plotted in green and 
connected by a sequence of red line segments that 
represents the medial axis. 


from the Weizmann Horse Database (WHD) [43, in 
which each image contains one or more horses. Aside 
from color variation, the dataset exhibits variations in 
scale, position, articulation of horse joints. The second 
dataset was created by Lee et al. (U from the Berkeley 
Segmentation Database (BSD) |48f. This set is denoted 










Fig. 7. Example detections on a sample of images 
from BSD-Parts. Columns left to right: input image, 
ground-truth masks, top 4 detection masks. Note that 
many images have more ground-truth masks than de¬ 
tections that can be shown here. 

as BSD-Parts and contains 36 BSD images which are 
annotated with ground-truth masks corresponding to 
the symmetric parts of prominent objects ( e.g .,, duck, 
horse, deer, snake, boat, dome, amphitheater). This 
contains a variety of natural and artificial objects and 
offers a balancing counterpart to the horse dataset. 

Both WHD and BSD-Parts are annotated with 
ground-truth masks corresponding to object parts in 
the image. The learning component of our approach 
requires ground-truth masks as input, for which we 
have held a subset of training images away from 
testing. Specifically, we trained our classifier on 20 
WHD images and used for evaluation the remaining 
61 WHD images and all 36 BSD-Parts images. This 
methodology supports a key point of our approach, 
which is that of mid-level transfer : by increasing feature 
invariance against image variability, we help prevent 
the classifier from overfitting to the objects on which it 
is trained. By training our model on horse images and 
applying it on other types of objects, we thus demon¬ 
strate the ability of our model to transfer symmetric 
part detection from one object class to another. 

6.1 Qualitative results 

Figure [7] presents our results on a sample of input 
images. For each image, the set of ground-truth masks 
are shown, followed by the top several detection 
masks. (Detection masks are indicated with the asso¬ 
ciated sequence of discs.) For clarity, individual detec¬ 


tions are shown in separate images. The tiger image 
demonstrates successful detection of its parts, which 
vary in curvature and taper. In the next example, 
vertical segments of the Florentine dome are detected 
by the same method. The next example shows re¬ 
covered parts of the boat. When suitably pruned, a 
configuration of parts hypothesized from a cluttered 
image can provide an index into a bank of part-based 
shape models. 

In the image of the fly, noise along the abdomen was 
captured by the affinity function at finer superpixel 
scales, resulting in multiple overlapping oversegmen¬ 
tations. The leaf was not detected, however, due to its 
symmetry being occluded. In the first snake image, 
low contrast along its tail yielded imperfect super¬ 
pixels that could not support correct segmentation, 
however the invariance to bending is impressive. The 
second snake is accompanied with a second thin 
detection along its shadow. We conclude with the 
starfish whose complex texture was not a difficult 
challenge for our method. We have demonstrated 
that symmetry is a powerful shape regularity that is 
ubiquitous across different objects. 

6.2 Quantitative results 

In the quantitative part of our evaluation, we use 
standard dataset metrics to evaluate the components 
of our approach. Specifically, we demonstrate the 
improvement contributed by formulating grouping 
as sequence optimization, and by using invariant 
features to train the classifier. Results are computed 
on the subset of WHD held out from training, and 
on BSD-Parts. To evaluate the quality of our detected 
symmetric parts, we compare them in the form of 
detection masks to the ground-truth masks using the 
standard intersection-over-union metric (IoU). A de¬ 
tection mask rridet is counted as a hit if its overlap with 
the ground-truth mask m gt is greater than 0.4, where 
overlap is measured by IoU \rridet ^m gt \/\md e t Ura 5 t|. 
We obtain a precision-recall curve by varying the 
threshold over the cost (weight) of detected parts. 

Figure [8] presents the performance curves corre¬ 
sponding to 4 different settings under our frame¬ 
work, evaluated on both WHD and BSD-Parts: 1) 
ellipse+clustering combines the ellipse-warped affin¬ 
ity with agglomerative clustering and corresponds 
to 0. We note that low precision is partly due to 
the lack of annotations on many background objects 
in both datasets; 2) ellipse+sequences combines the 
ellipse-warped affinity with sequence optimization; 3) 
deform+sequences combines deformable warping with 
sequence optimization, and corresponds to (fl; and 4) 
deform+unsmooth sets the triplewise weights in cost(D) 
uniformly to zero rather than using the affinity as 
done in the previous setting. A corresponding drop in 
performance shows that smoothness is an important 
feature of symmetric parts. In summary, experimental 
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Fig. 8. Performance curves for corresponding to differ¬ 
ent settings of the components of our approach on (a) 
BSD-Parts and (b) WHD. See text for details. 


results confirm that both the added deformations and 
sequence optimization are individually effective at 
improving the accuracy of our approach. 

7 Conclusion 

Symmetry figured prominently in early object recog¬ 
nition systems, but the potential of this powerful 
cue is largely overlooked in contemporary computer 
vision. In this article, we have reviewed a framework 
that attempts to reintroduce medial symmetry into the 
current research landscape. The key concept behind 
the framework is remodeling the discs of the MAT 
as compact superpixels, learning a pairwise affinity 
function between discs with a symmetry-invariant 
transform, and formulating a discrete optimization 
problem to find the best sequences of discs. We have 


summarized quantitative results that encourage fur¬ 
ther exploration of using symmetry for object recog¬ 
nition. 

We have reviewed ways in which we overcame the 
early limitations of our approach, such as using ad¬ 
ditional deformation parameters to improve warping 
accuracy, and reformulating grouping as a discrete 
optimization problem to improve results. There are 
also current limitations to be addressed in future 
work. To briefly mention two, we first note that the 
success of using Gestalt grouping cues such as sym¬ 
metry depends on effectively combining multiple cues 
together. To improve the robustness of our system, 
we are thus exploring how to incorporate additional 
mid-level cues such as contour closure. This will 
help our system more accurately resolve cases where 
different features provide conflicting cues, and thus 
improve the overall performance. Secondly, our scope 
is bottom-up detection and thus is agnostic of object 
categories. However, in a detection or verification 
task, top-down cues may be available. We are thus 
investigating ways of integrating top-down cues into 
our framework. 

In conclusion, we have reviewed an approach for 
reintroducing the MAT back into contemporary com¬ 
puter vision, by leveraging the formulation of maxi¬ 
mal discs as compact superpixels to derive symmetry- 
based affinity function and grouping algorithms. 
Quantitative results encourage further development 
of the framework to recover medial-based parts from 
cluttered scenes. Finally, as initial explored in |3|, 
detected parts must be non-accidentally grouped be¬ 
fore they yield the distinctiveness required for object 
recognition. 
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